Skip to content

feat: Support Parquet writer options #1123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

nuno-faria
Copy link

@nuno-faria nuno-faria commented May 5, 2025

Which issue does this PR close?

N/A.

Rationale for this change

Supporting all Parquet writer options allows us more flexibility when creating data directly from datafusion-python.

For consistency, it supports all writer options defined by ParquetOptions in datafusion, using the same defaults: https://github.com/apache/datafusion/blob/555fc2e24dd669e44ac23a9a1d8406f4ac58a9ed/datafusion/common/src/config.rs#L423.

What changes are included in this PR?

  • Extended write_parquet with all writer options, including column-specific options.
  • Added relevant tests. (Since pyarrow does not expose page-level information, some options could not be directly tested, like enabling bloom-filters (an external tool confirmed that this option works). For this specific case, there is a test that compares the file sizes.)

Are there any user-facing changes?

The main difference relates to the existing compression field, which now uses a str like datafusion, instead of a custom enum. The main advantage is that future algorithms will not require updating the Python-side code.

Additionally, the default compression was changed from zstd(4) to zstd(3), the same as datafusion.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I really like the idea. Right now this does include a breaking change to a very popular user facing function. I think if we make the suggestion to allow for two function signatures we'll be able to include this in the next release.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Thank you!

I might add a follow on PR that would overload the write_parquet to simply identify it was getting passed these options or the old signature. I don't think that's blocking for what you have here.

I think write_parquet_with_options would be a slightly more explicit function name, but also not blocking for this PR.

If you can resolve the merge conflicts, I'll rerun CI and if all goes through I can merge it in soon.

Thank you again!

@nuno-faria
Copy link
Author

Conflicts are resolved. I also renamed to write_parquet_with_options.

@timsaucer
Copy link
Contributor

Looks great. There are some minor ruff errors. After that it looks good to merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants